Constructing custom knowledgebases and sequence datasets with publications

ABSTRACT

Illustrative embodiments of custom knowledgebases and sequence datasets, as well as related methods, are disclosed. In one illustrative embodiment, one or more computer-readable media may comprise a custom knowledgebase and an associated sequence dataset. The custom knowledgebase may comprise a plurality of assertions that have been automatically extracted from a plurality of publications, where each of the plurality of assertions encodes a relationship between a subject and an object. The sequence dataset may comprise a plurality of called biological sequences, where each of the plurality of called biological sequences is associated with one or more of the plurality of assertions of the custom knowledgebase.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. patent application Ser. No.14/280,285, filed May 16, 2014, the entire disclosure of which is herebyincorporated by reference.

TECHNICAL FIELD

The present disclosure relates, generally, to custom knowledgebases andsequence datasets and, more particularly, to custom knowledgebases andsequence datasets that may be used to interrogate biological sequencedata from metagenomic samples.

BACKGROUND

A knowledgebase is a technology used to store complex structured and/orunstructured information that may be used by a computing device (e.g., aknowledge-based system or expert system) to deduce new information.Knowledgebases often represent their stored information using an objectmodel (sometimes called an “ontology”) with classes, subclasses, andinstances. This ontology permits the representation of knowledge as ahierarchy of concepts with a particular domain, using ashared/controlled vocabulary to denote types, properties, and/orinterrelationships associated with the information.

Some attempts have been made to develop knowledgebases in the areas ofgenetics and genomics. For instance, the Comprehensive AntibioticResearch Database (CARD), described in McArthur et al., “TheComprehensive Antibiotic Resistance Database,” Antimicrobial Agents andChemotherapy, vol. 57, pp. 3348-3357 (2013), includes data describingantibiotics and their targets along with antibiotic resistance genes,associated proteins, and antibiotic resistance literature. The CARDutilizes an Antibiotic Resistance Ontology (ARO) for the classificationof antibiotic resistance gene data. Existing knowledgebases in the areasof genetics and genomics, however, have typically relied entirely onsubject matter experts to manually construct the ontologies used by theknowledgebases.

SUMMARY

The present invention may comprise any one or more of the featuresrecited in the appended claims, any one or more of the followingfeatures, and/or any combinations thereof.

According to one aspect, a method may comprise automatically extractinga plurality of assertions from a plurality of publications, wherein eachof the plurality of assertions encodes a relationship between a subjectand an object, manually editing the plurality of assertionsautomatically extracted from the plurality of publications to constructa custom knowledgebase for a particular biological field, andconstructing a sequence dataset comprising a plurality of calledbiological sequences, wherein each of the plurality of called biologicalsequences is associated with one or more of the plurality of assertionsof the custom knowledgebase.

In some embodiments, manually editing the plurality of assertionsautomatically extracted from the plurality of publications may compriseat least one of (i) selecting a subset of the plurality of assertionsautomatically extracted from the plurality of publications for inclusionin the custom knowledgebase, (ii) modifying the content of one or moreof the plurality of assertions automatically extracted from theplurality of publications for inclusion in the custom knowledgebase, and(iii) creating one or more additional assertions for inclusion in thecustom knowledgebase. The manual editing of the plurality of assertionsautomatically extracted from the plurality of publications may beperformed by one or more subject matter experts in the particularbiological field.

In some embodiments, automatically extracting the plurality ofassertions from the plurality of publications may comprise utilizingnatural language processing software to derive the plurality ofassertions from the text of the plurality of publications. The pluralityof publications may comprise peer-reviewed articles selected by thesubject matter experts. The natural language processing software may betrained by the subject matter experts to recognize relevant assertionsin the text of the plurality of publications. Each of the plurality ofassertions may be expressed as a Resource Description Framework (RDF)triple.

In some embodiments, constructing the sequence dataset may compriseautomatically extracting one or more called biological sequences fromthe plurality of publications. Constructing the sequence dataset mayfurther comprise extracting additional called biological sequences fromone or more publicly available databases, grouping the additional calledbiological sequences with the one or more called biological sequencesautomatically extracted from the plurality of publications in responseto one or more predetermined resemblance criteria being met, andassociating each group of called biological sequences with one or moreof the plurality of assertions of the custom knowledgebase. Theplurality of called biological sequences included in the sequencedataset and the associations between the plurality of called biologicalsequences and the plurality of assertions of the custom knowledgebasemay be manually edited by the subject matter experts.

According to another aspect, one or more computer-readable media maycomprise a custom knowledgebase comprising a plurality of assertionsthat have been automatically extracted from a plurality of publications,wherein each of the plurality of assertions encodes a relationshipbetween a subject and an object, and a sequence dataset comprising aplurality of called biological sequences, wherein each of the pluralityof called biological sequences is associated with one or more of theplurality of assertions of the custom knowledgebase.

In some embodiments, the plurality of assertions automatically extractedfrom the plurality of publications may have been manually edited by oneor more subject matter experts in a biological field of the customknowledgebase. The one or more computer-readable media may further aclient application configured to compare a plurality of samplebiological sequences to the plurality of called biological sequences ofthe sequence dataset and determine, for each sample biological sequencethat resembles a called biological sequence of the sequence dataset, oneor more probable characteristics associated with that sample biologicalsequence using one or more assertions of the custom knowledgebase thatare associated with the called biological sequence that resembles thatsample biological sequence.

In some embodiments, the plurality of called biological sequences of thesequence dataset comprise at least one of called biological sequencesthat provide resistance to one or more antibiotics and called biologicalsequences that mediate regulation of antibiotic resistance, and theplurality of assertions of the custom knowledgebase comprise assertionsthat encode relationships between the called biological sequences of thesequence dataset and at least one of antibiotic resistance elements andregulatory elements. The plurality of assertions of the customknowledgebase may further comprise assertions that encode relationshipsbetween antibiotic resistance elements and particular resistedantibiotics.

According to yet another aspect, a method may comprise comparing aplurality of sample biological sequences to a plurality of calledbiological sequences included in a sequence dataset, retrieving, from acustom knowledgebase associated with the sequence dataset, one or moreassertions that are associated with a called biological sequence of thesequence dataset that resembles one of the plurality of samplebiological sequences, wherein the custom knowledgebase comprises aplurality of assertions that have been automatically extracted from aplurality of publications, each of the plurality of assertions encodinga relationship between a subject and an object, and determining one ormore probable characteristics associated with the sample biologicalsequence that resembles the called biological sequence of the sequencedataset using the one or more assertions retrieved from the customknowledgebase.

In some embodiments, the plurality of assertions automatically extractedfrom the plurality of publications may have been manually edited by oneor more subject matter experts in a biological field of the customknowledgebase. The method may further comprise generating the pluralityof sample biological sequences using massively parallel sequencing of ametagenomic sample. Determining one or more probable characteristicsassociated with the sample biological may sequence comprise determiningone or more antibiotics likely to be resisted. The method may furthercomprise generating a report that comprises a ranked listing of theantibiotics likely to be resisted.

BRIEF DESCRIPTION OF THE DRAWINGS

The concepts described in the present disclosure are illustrated by wayof example and not by way of limitation in the accompanying figures. Forsimplicity and clarity of illustration, elements illustrated in thefigures are not necessarily drawn to scale. For example, the dimensionsof some elements may be exaggerated relative to other elements forclarity. Further, where considered appropriate, reference labels havebeen repeated among the figures to indicate corresponding or analogouselements. The detailed description particularly refers to theaccompanying figures in which:

FIG. 1 is a simplified block diagram illustrating one embodiment of anenvironment including a custom knowledgebase, a sequence dataset, and aclient application;

FIG. 2 is a simplified flow diagram illustrating one embodiment of amethod of constructing the custom knowledgebase and the sequence datasetof FIG. 1; and

FIG. 3 is a simplified flow diagram illustrating one embodiment of amethod of using the client application, the sequence dataset, and thecustom knowledgebase of FIG. 1 to interrogate sample biological sequencedata.

DETAILED DESCRIPTION OF THE DRAWINGS

While the concepts of the present disclosure are susceptible to variousmodifications and alternative forms, specific embodiments thereof havebeen shown by way of example in the drawings and will herein bedescribed in detail. It should be understood, however, that there is nointent to limit the concepts of the present disclosure to the particularforms disclosed, but on the contrary, the intention is to cover allmodifications, equivalents, and alternatives consistent with the presentdisclosure and the appended claims.

References in the specification to “one embodiment,” “an embodiment,”“an example embodiment,” etcetera, indicate that the embodimentdescribed may include a particular feature, structure, orcharacteristic, but every embodiment may not necessarily include theparticular feature, structure, or characteristic. Moreover, such phrasesare not necessarily referring to the same embodiment. Further, when aparticular feature, structure, or characteristic is described inconnection with an embodiment, it is submitted that it is within theknowledge of one skilled in the art to effect such feature, structure,or characteristic in connection with other embodiments whether or notexplicitly described.

Embodiments of the concepts described herein may be implemented inhardware, firmware, software, or any combination thereof. For instance,embodiments of the concepts described herein may be implemented as dataand/or instructions carried by or stored on one or more machine-readableor computer-readable storage media, which may be read and/or executed byone or more processors. A machine-readable or computer-readable storagemedium may be embodied as any device, mechanism, or physical structurefor storing or transmitting information in a form readable by a machine(e.g., a computing device or system). For example, a machine-readable orcomputer-readable storage medium may be embodied as read only memory(ROM) device(s); random access memory (RAM) device(s); magnetic diskstorage media; optical storage media; flash memory devices; mini- ormicro-SD cards, memory sticks, and others.

In the drawings, specific arrangements or orderings of schematicelements, such as those representing devices, modules, software, anddata elements, may be shown for ease of description. However, it shouldbe understood by those skilled in the art that the specific ordering orarrangement of the schematic elements in the drawings is not meant toimply that a particular order or sequence of processing, or separationof processes, is required. Further, the inclusion of a schematic elementin a drawing is not meant to imply that such element is required in allembodiments or that the features represented by such element may not beincluded in or combined with other elements in some embodiments.

In general, schematic elements used to represent software may beimplemented using any suitable form of machine-readable instruction,such as software or firmware applications, programs, functions, modules,routines, processes, procedures, plug-ins, applets, widgets, codefragments and/or others, and that each such instruction may beimplemented using any suitable programming language, library,application programming interface (API), and/or other softwaredevelopment tools. For example, some embodiments may be implementedusing Java, C++, and/or other programming languages. Similarly,schematic elements used to represent data or information may beimplemented using any suitable electronic arrangement or structure, suchas a register, data store, table, record, array, index, hash, map, tree,list, graph, file (of any file type), folder, directory, database,and/or others.

Further, in the drawings, where connecting elements, such as solid ordashed lines or arrows, are used to illustrate a connection,relationship or association between or among two or more other schematicelements, the absence of any such connecting elements is not meant toimply that no connection, relationship or association can exist. Inother words, some connections, relationships or associations betweenelements may not be shown in the drawings so as not to obscure thedisclosure. In addition, for ease of illustration, a single connectingelement may be used to represent multiple connections, relationships orassociations between elements. For example, where a connecting elementrepresents a communication of signals, data, instructions, or otherinformation, it should be understood by those skilled in the art thatsuch element may represent one or multiple signal paths, as may beneeded, to effect the communication.

The present disclosure relates to custom knowledgebases and sequencedatasets that are constructed and curated using semi-automated methods.In particular, the knowledgebase may comprise assertions that areautomatically extracted from the professional literature and thenmanually edited by subject matter experts in the particular biologicalfield to which the knowledgebase is directed. Similarly, the sequencedataset associated with the custom knowledgebase may comprise calledbiological sequences (e.g., nucleotide sequences, protein sequences,etc.) that are automatically extracted from the professional literature(as well as other public sources) and associated with the assertions ofthe custom knowledgebase, subject to manual editing by the subjectmatter experts. Using the presently disclosed methods, a customknowledgebase and an associated sequence dataset for antibioticresistance have been constructed. In that illustrative embodiment, theantibiotic resistance knowledgebase contains assertions automaticallyextracted from over 800 peer-reviewed articles, while the antibioticresistance sequence dataset contains over 3,800 biological sequencetypes and over 250,000 individual biological sequences.

Once constructed, the custom knowledgebases and sequence datasets of thepresent disclosure may be used to interrogate biological sequences thatare read from metagenomic samples. For instance, using the illustrativeantibiotic resistance knowledgebase and sequence dataset, a clientapplication can identify antibiotic resistance elements in samplebiological sequences and report on what antibiotic drugs are likely tobe resisted as a result of the identified antibiotic resistanceelements. As such, the illustrative antibiotic resistance knowledgebaseand sequence dataset may support microbial biothreat identification,surveillance, and/or analysis tools that are rapid, accurate, and/orfield-accessible/deployable. Similarly, the illustrative antibioticresistance knowledgebase and sequence dataset may also be used toimplement real-time and accurate infectious disease decision supporttools for clinicians at the point-of-care. While many of the features ofthe present disclosure will be described with reference to theillustrative embodiment of a custom knowledgebase and sequence datasetfor antibiotic resistance, it is contemplated that custom knowledgebasesand sequence datasets according to the present disclosure might also beconstructed and utilized to interrogate biological sequences for anynumber of characteristics, including, but not limited to, virulenceelements, hydrocarbon-degrading enzymes, visible characteristics (e.g.,in human genomes), and race performance factors (e.g., in horsegenomes).

Referring now to FIG. 1, one illustrative embodiment of an environmentincluding a custom knowledgebase 100, a sequence dataset 102, and aclient application 104 is shown as a simplified block diagram. Thecustom knowledgebase 100 represents the knowledge of a particularbiological field (e.g., antibiotic resistance) and is organized aroundan ontology 106 that is specific to that biological field. In otherwords, the custom knowledgebase 100 organizes the information needed tounderstand and represent that particular biological field with referenceto the professional literature. In the illustrative embodiment, thecustom knowledgebase 100 is embodied as data stored on one or morecomputer-readable media.

The custom knowledgebase 100 comprises a plurality of assertions 108,each of which encodes a relationship between a subject and an object, asillustrated in FIG. 1. In the illustrative embodiment, each of theassertions 108 is expressed as a Resource Description Framework (RDF)triple. As such, the assertions 108 have the form: subject→verb (or verbphrase)→object. The assertions 108 may encode any number ofrelationships, which will be dependent on the particular biologicalfield represented by the custom knowledgebase 100 and the ontology 106used. In the illustrative embodiment of the antibiotic resistanceknowledgebase 100, by way of example, the assertions 108 may representrelationships such as “[subject] confers resistance to drug [object],”where the subject is a particular protein sequence or its encodingnucleotide sequence and the object is a particular antibiotic drug. Theassertions 108 may also represent relationships with various antibioticresistance elements and regulatory elements. For instance, some of theassertions 108 may encode a relationship between a biological sequence(or group of biological sequences) and an antibiotic resistance elementor regulatory element, while other assertions 108 may encode arelationship between an antibiotic resistance element and a particularresisted antibiotic.

In the illustrative embodiment of the antibiotic resistanceknowledgebase 100, the assertions 108 comprehensively describe thevarious classes of antibiotic resistance elements, including effluxpumps and their components, antibiotic inactivating enzymes, antibiotictarget-altering enzymes, antibiotic target replacement proteins,proteins that result in reduced permeability to antibiotics, as well assequence mutants that confer antibiotic resistance. The assertions 108of the illustrative antibiotic resistance knowledgebase 100 alsodescribe sequence elements that regulate expression of the types ofresistance. Furthermore, the assertions 108 specify particular resistedantibiotic drugs for each type of antibiotic resistance. Therelationships between the antibiotic resistance elements, regulatoryelements, and antibiotic drugs are all described by the ontology 106.

As described in greater detail below (with reference to FIG. 2), thecustom knowledgebase 100 is constructed and/or curated in asemi-automated manner. In particular, many of the assertions 108 of thecustom knowledgebase 100 are generated automatically via extraction froma number of publications 110. In some embodiments, the publications 110may be peer-reviewed articles from the relevant biological field thathave been selected by subject matter experts in that field. Asillustrated in FIG. 1, an extraction engine 112 may be used to digestthe text of the publications 110 to derive the assertions 108 from thepublications 110. For instance, the extraction engine 112 may analyzethe text of the publications 110 for assertions 108 that fit thesubject-relationship-object format and then encode each of theseassertions 108 as an RDF triple. The assertions 108 derived by theextraction engine 112 may then be manually edited (e.g., by subjectmatter experts) to construct the custom knowledgebase 100. As discussedfurther below, this manual editing may involve associating anautomatically extracted assertion 108 with a particular term of theontology 106. It will be appreciated that, in contrast to prior artknowledgebases that have been generated by subject matter experts in anentirely manual fashion, the semi-automated construction and curationmethods of the present disclosure offer significant time and costsavings and/or corresponding increases in the completeness of the customknowledgebase 100.

The sequence dataset 102 comprises a plurality of called biologicalsequences 114 that are relevant to the biological field of the customknowledgebase 100. In the illustrative embodiment, the sequence dataset102 is embodied as data stored on one or more computer-readable media.Each of the called biological sequences 114 of the sequence dataset 102is associated with one or more of the assertions 108 of the customknowledgebase 100. In other words, each of the called biologicalsequences 114 is linked to one or more assertions 108 that describe thatcalled biological sequence 114. In the illustrative embodiment, thecalled biological sequences 114 of the sequence dataset 102 are alsogrouped by types that may be described by the same assertion(s) 108. Theassociations between the called biological sequences 114 (or groupsthereof) and the assertions 108 may be established automatically and/ormanually by subject matter experts.

In the illustrative embodiment of the antibiotic resistance sequencedataset 102, the called biological sequences 114 include both calledbiological sequences 114 that provide resistance to one or moreantibiotics and called biological sequences 114 that mediate regulationof antibiotic resistance. By way of example, the called biologicalsequences 114 of the antibiotic resistance sequence dataset 102 includeprotein sequences associated with resistance to particular antibiotics,as well as the encoding DNA sequences for those proteins. In someembodiments of the sequence dataset 102, some of the called biologicalsequences 114 may include adjoining or flanking sequences (in additionto the sequence elements directly associated with one or more assertions108) to provide for more robust matching of sample biological sequences118 to those called biological sequences 114.

Like the custom knowledgebase 100, the sequence dataset 102 may beconstructed and/or curated in a semi-automated manner (as described ingreater detail below with reference to FIG. 2). In particular, some ofcalled biological sequences 114 of the sequence dataset 102 may beextracted from the publications 110 (in some embodiments, at the sametime the assertions 108 are extracted from the publications 110). Asillustrated in FIG. 1, the extraction engine 112 may be used to digestthe text of the publications 110 to extract the called biologicalsequences 114 from the text. For instance, when the extraction engine112 detects an assertion 108 in the one of the publications 110, theextraction engine 112 may then search for called biological sequences114 set forth in the publication as examples of that assertion 108. Thecalled biological sequences 114 found by the extraction engine 112 maythen be manually edited (e.g., by subject matter experts) to constructthe sequence dataset 102.

It is also contemplated that, in some embodiments, additional calledbiological sequences 114 may be automatically extracted from publicallyavailable databases 116 (e.g., National Center for BiotechnologyInformation (NCBI) databases) and added to the sequence dataset 102. Asdescribed in greater detail below (with reference to FIG. 2), theseadditional called biological sequences 114 may be compared to the calledbiological sequences 114 extracted from the publications 110 todetermine whether they sufficiently resemble one another. If theadditional called biological sequences 114 and the called biologicalsequences 114 extracted from the publications 110 meet certainpredetermined resemblance criteria, they may be grouped together andassociated with the same assertion(s) 108 in the custom knowledgebase100.

The client application 104 interacts with the custom knowledgebase 100and the sequence dataset 102 to infer information about samplebiological sequences 118. The client application may receive the samplebiological sequences 118 from any number of sources (e.g., as part of aFASTA or FASTQ format computer file). As described in greater detailbelow (with reference to FIG. 3), the client application 104 may beconfigured to compare the sample biological sequences 118 to the calledbiological sequences 114 of the sequence dataset 102. Where a samplebiological sequence 118 sufficiently resembles one of the calledbiological sequences 114 included in the sequence dataset 102, theclient application 104 may then use the assertion(s) 118 of the customknowledgebase 100 that are associated with that called biologicalsequence 114 to determine one or more probable characteristicsassociated with that sample biological sequence 118. In other words, theclient application 104 may utilize the knowledge represented by thecustom knowledgebase 100 and the sequence dataset 102 to predictcharacteristics that will be expressed in the sample from which thesample biological sequences 118 were read.

The client application 104 may generate a report 120 summarizing theresults of interrogating one or more sample biological sequences 118,including the probable characteristic(s) determined to be associatedwith those sample biological sequences 118. In some embodiments, thereport 120 may include a ranked listing of antibiotics that are likelyto be resisted by the sample from which the sample biological sequences118 were read. By way of illustrative example, the report 120 may list anumber of antibiotics beginning with those with the most resistanceelements present in the sample and concluding with those with the fewest(or no) resistance elements present in the sample. In some embodiments,the report 120 might also include the minimum inhibitory concentrationsfor the listed antibiotics and even citations (and/or hyperlinks) torelevant publications. It will be appreciated that many other formatsfor the report 120 are possible.

In the illustrative embodiment, the client application 104 is embodiedas software instructions stored on one or more computer-readable media(which may be executed by one or more processors). The clientapplication 104 may provide a custom graphical user interface (GUI) tousers of the custom knowledgebase 100 and the sequence dataset 102 thatallows the users to create new reports, access old reports, storereports, and keep track of different cases based on particularmetagenomic sequence samples.

Referring now to FIG. 2, one illustrative embodiment of a method 200 ofconstructing the custom knowledgebase 100 and the sequence dataset 102is shown as a simplified flow diagram. The method 200 is illustrated asa number of blocks 202-218. Although the blocks 202-218 are generallyshown and described sequentially in the present disclosure, it will beappreciated that the blocks 202-218 do not necessarily need to beperformed in a particular order (unless otherwise noted below). Forinstance, it is contemplated that many of the blocks 202-218 might beperformed in parallel with other blocks during the method 200.

The method 200 begins with block 202 in which the extraction engine 112is trained to recognize assertions 108 relevant to a particularbiological field in the text of the publications 110. In someembodiments, block 202 may involve subject matter experts (and/orothers) providing the extraction engine 112 with examples of relevantassertions 108. Block 202 might also involve subject matter experts(and/or others) reviewing the results of previous attempts by theextraction engine 112 to extract assertions 108 from the text ofpublications 110 and providing feedback to the extraction engine 112 toimprove its performance. In other words, it is contemplated that, insome embodiments, the blocks 202-206 may be performed iteratively aspart of training the extraction engine 112 to recognize assertions 108relevant to the particular biological field. In block 204, one or morepublications 110 are selected to be input to the extraction engine 112for the extraction of assertions 108 from the text of those publications110. In some embodiments, block 204 may involve subject matter expertsselecting peer-reviewed articles 110 from the relevant biological fieldthat should be input to the extraction engine.

After block 204, the method 200 proceeds to block 206 in which theextraction engine 112 automatically extracts a plurality of assertions108 from the publications 110. In some embodiments, the extractionengine 112 may include natural language processing software 112 toderive the assertions 108 from the text of the publications 110. In oneillustrative embodiment, the natural language processing software 112may be embodied as the K-Platform Extractor™, commercially availablefrom Lymba Corporation of Richardson, Texas. As discussed above, in theillustrative embodiment, each of the assertions 108 extracted by thenatural language processing software 112 is expressed as an RDF triplethat encodes a relationship between a subject and an object (see FIG.1). In some embodiments of method 200, block 206 may also involveautomatically extracting one or more called biological sequences 114from the publications 110. For instance, when the extraction engine 112detects an assertion 108 in the one of the publications 110, theextraction engine 112 may then search for called biological sequences114 set forth in the publication as examples of that assertion 108.

After block 206, the method 200 proceeds to block 208 in which thecustom knowledgebase 100 is constructed using the assertions 108 thatwere automatically extracted from the publications 110 (during block206). As illustrated in FIG. 2, block 206 also involves block 208, inwhich the automatically extracted assertions 108 are manually edited(e.g., by one or more subject matter experts in the particularbiological field to which the custom knowledgebase 100 is directed). Itis contemplated that the manual editing of the assertions 108 in block208 may involve a number of tasks, including, but not limited to,selecting a subset of the assertions 108 automatically extracted fromthe publications 110 for inclusion in the custom knowledgebase 100 (or,alternatively, deleting the assertions 108 that should not be includedin the custom knowledgebase 100), modifying the content of one or moreof the assertions 108 automatically extracted from the publications 110,and/or creating one or more additional assertions 108 for inclusion inthe custom knowledgebase 100.

After block 206, the method 200 also proceeds to block 212 in which thesequence dataset 102 is constructed. In the illustrative embodimentshown in FIG. 2, block 212 involves constructing the sequence dataset102 using the called biological sequences 114 that were automaticallyextracted from the publications 110 during block 206. In someembodiments, block 212 may also involve blocks 214, 216. In block 214,additional called biological sequences 114 are extracted from one ormore publicly available databases 116 (e.g., an NCBI database). Afterblock 214, the method 200 proceeds to block 216 in which the additionalcalled biological sequences 114 extracted from the databases 116 arecompared to the called biological sequences 114 extracted from thepublications 110. Where one or more predetermined resemblance criteriabetween the additional called biological sequences 114 extracted fromthe databases 116 and the called biological sequences 114 extracted fromthe publications 110 are met, these called biological sequences 114 aregrouped together.

After blocks 208 and 212 (and/or, in some embodiments, during blocks208, 212), the method 200 proceeds to block 218 in which each calledbiological sequence 114 (or group of called biological sequences 114) isassociated with one or more of the assertions 108 of the customknowledgebase 100. In some embodiments, block 218 may involve manualediting of the associations between the called biological sequences 114and the assertions 108 of the custom knowledgebase 100 (e.g., by subjectmatter experts). In other embodiments, the associations of the calledbiological sequences 114 with the assertions 108 may be partially orfully automated. For instance, an association between an assertion 108and a called biological sequences 114 that are both automaticallyextracted from a publication 110 during block 206 may be maintainedthroughout the method 200.

While the method 200 has generally been described above in terms ofnewly constructing the custom knowledgebase 100 and the sequence dataset102, it will be appreciated that the method 200 may also be utilized tocurate or update the custom knowledgebase 100 and the sequence dataset102 on an ongoing basis. For instance, new publications 110 mayperiodically be input to the extraction engine 112 to extract newassertions 108 and called biological sequences 114 in order to keep thecustom knowledgebase 100 and the sequence dataset 102 up-to-date.Similarly, as additional called biological sequences 114 areperiodically added to the publicly available databases 116, theseadditional called biological sequences 114 may be added to the sequencedataset 102.

Referring now to FIG. 3, one illustrative embodiment of a method 300 ofusing the client application 104, the sequence dataset 102, and thecustom knowledgebase 100 to interrogate the sample biological sequences118 is shown as a simplified flow diagram. The method 300 is illustratedas a number of blocks 302-310. Although the blocks 302-310 are generallyshown and described sequentially in the present disclosure, it will beappreciated that the blocks 302-310 do not necessarily need to beperformed in a particular order (unless otherwise noted below). Forinstance, it is contemplated that many of the blocks 302-310 might beperformed in parallel with other blocks during the method 300.

The method 300 begins with optional block 302 in which a plurality ofsample biological sequences 118 are generated using MPS of a metagenomicsample. In other embodiments, where a data file (e.g., a FASTA or FASTQformat file) containing sample biological sequences 118 is received, theoptional block 302 need not be performed as part of the method 300. Ineither case, the client application 104 receives sample biologicalsequences 118 (either from an MPS instrument or from a data file) priorto proceeding to block 304.

In block 304, the client application 104 communicates with the sequencedataset 102 to compare the sample biological sequences 118 to the calledbiological sequences 114 included in the sequence dataset 102. As aresult of the comparisons performed in block 304, the client application104 determines whether any of the sample biological sequences 118resembles, or “matches,” one or more of the called biological sequences114. In some illustrative embodiments, an alignment algorithm (e.g., theBLAST algorithm) may be used to determine a degree of resemblancebetween each sample biological sequence 118 and each of the calledbiological sequences 114 included in the sequence dataset 102. Eachsample biological sequence 118 may be “matched” to the called biologicalsequences 114 with the highest degree of resemblance, assuming theresemblance exceeds a threshold value. Alternatively, if a samplebiological sequence 118 does not sufficiently resemble any of the calledbiological sequences 114, the client application 104 may determine thatthe sample biological sequence 118 has no matches in the sequencedataset 102. In other embodiments, the client application 104 mayrequire exact matching between the sample biological sequences 118 andthe called biological sequences 114 during block 304.

For each sample biological sequence 118 that is determined to resembleone of the called biological sequences 114 (during block 304), themethod 300 proceeds to block 306 in which the client application 104communicates with the custom knowledgebase 100 to retrieve one or moreof the assertions 108. In particular, during block 306, the assertion(s)108 that are associated with the called biological sequence 114determined to resemble the sample biological sequence 118 are retrieved.

After block 306, the method 300 proceeds to block 308 in which theassertions 108 retrieved from the custom knowledgebase 100 (in block306) are used to determine one or more probable characteristicsassociated with the sample biological sequence 118. The resemblancebetween the sample biological sequence 118 and the called biologicalsequence 114 in combination with the assertions 108 associated with thecalled biological sequence 114 allow the custom knowledgebase 100 to beused to infer information about the sample biological sequence 118. Inthe illustrative embodiment, block 308 may involve determining one ormore antibiotics likely to be resisted by the sample from which thesample biological sequences 118 were read. Like block 306, block 308 isperformed for each sample biological sequence 118 that was determined(in block 304) to resemble one of the called biological sequences 114.

The method 300 may conclude with optional block 310 in which a report120 is generated that includes the probable characteristic(s) determinedto be associated with the sample biological sequences 118 (in block308). In the illustrative embodiment, the report 120 includes a rankedlisting of antibiotics that are likely to be resisted by the sample (asdetermined in block 308). As noted above, the report 120 may list anumber of antibiotics beginning with those with the most resistanceelements present in the sample and concluding with those with the fewest(or no) resistance elements present in the sample. It will beappreciated that, in other embodiments, alternative formats for thereport 120 may be used.

While certain illustrative embodiments have been described in detail inthe figures and the foregoing description, such an illustration anddescription is to be considered as exemplary and not restrictive incharacter, it being understood that only illustrative embodiments havebeen shown and described and that all changes and modifications thatcome within the spirit of the disclosure are desired to be protected.There are a plurality of advantages of the present disclosure arisingfrom the various features of the methods, systems, and articlesdescribed herein. It will be noted that alternative embodiments of themethods, systems, and articles of the present disclosure may not includeall of the features described yet still benefit from at least some ofthe advantages of such features. Those of ordinary skill in the art mayreadily devise their own implementations of the methods, systems, andarticles that incorporate one or more of the features of the presentdisclosure.

1-20. (canceled)
 21. A method comprising: automatically extracting aplurality of assertions from a plurality of publications, wherein eachof the plurality of assertions encodes a relationship between a subjectand an object; automatically extracting one or more called biologicalsequences from the plurality of publications; extracting additionalcalled biological sequences from one or more publicly availabledatabases; grouping the additional called biological sequences with theone or more called biological sequences automatically extracted from theplurality of publications in response to one or more resemblancecriteria being met; and associating each group of called biologicalsequences with one or more of the plurality of assertions.
 22. Themethod of claim 21, wherein each of the one or more resemblance criteriais predetermined.
 23. The method of claim 21, wherein automaticallyextracting the plurality of assertions from the plurality ofpublications comprises utilizing natural language processing software toderive the plurality of assertions from the text of the plurality ofpublications.
 24. The method of claim 23, wherein the plurality ofpublications comprises peer-reviewed articles selected by subject matterexperts in a field associated with the peer-reviewed articles.
 25. Themethod of claim 23, wherein the natural language processing software hasbeen trained by subject matter experts in a field associated with theplurality of publications to recognize relevant assertions in the textof the plurality of publications.
 26. The method of claim 23, whereineach of the plurality of assertions is expressed as a ResourceDescription Framework (RDF) triple.
 27. The method of claim 21, furthercomprising manually editing the associations between each group ofcalled biological sequences and the plurality by subject matter expertsin a field associated with the plurality of publications.
 28. One ormore tangible non-transitory computer-readable media comprising aplurality of instructions that, when executed by computing device,causes the computing device to: automatically extract a plurality ofassertions from a plurality of publications, wherein each of theplurality of assertions encodes a relationship between a subject and anobject; automatically extract one or more called biological sequencesfrom the plurality of publications; extract additional called biologicalsequences from one or more publicly available databases; group theadditional called biological sequences with the one or more calledbiological sequences automatically extracted from the plurality ofpublications in response to one or more resemblance criteria being met;and associate each group of called biological sequences with one or moreof the plurality of assertions.
 29. The one or more tangiblenon-transitory computer-readable media of claim 28, wherein each of theone or more resemblance criteria is predetermined.
 30. The one or moretangible non-transitory computer-readable media of claim 28, wherein toautomatically extract the plurality of assertions from the plurality ofpublications comprises to utilize natural language processing softwareto derive the plurality of assertions from the text of the plurality ofpublications.
 31. The one or more tangible non-transitorycomputer-readable media of claim 30, wherein the plurality ofpublications comprises peer-reviewed articles selected by subject matterexperts in a field associated with the peer-reviewed articles.
 32. Theone or more tangible non-transitory computer-readable media of claim 30,wherein the natural language processing software has been trained bysubject matter experts in a field associated with the plurality ofpublications to recognize relevant assertions in the text of theplurality of publications.
 33. The one or more computer-readable mediaof claim 30, wherein each of the plurality of assertions is expressed asa Resource Description Framework (RDF) triple.
 34. A method comprising:comparing a plurality of sample biological sequences to a plurality ofcalled biological sequences included in a sequence dataset; retrieving,from a custom knowledgebase associated with the sequence dataset, one ormore assertions that are associated with a called biological sequence ofthe sequence dataset that resembles one of the plurality of samplebiological sequences, wherein the one of the plurality of samplebiological sequences is not in the sequence dataset; and determining oneor more probable characteristics associated with the sample biologicalsequence that resembles the called biological sequence of the sequencedataset using the one or more assertions retrieved from the customknowledgebase.
 35. The method of claim 34, further comprising generatingthe plurality of sample biological sequences using massively parallelsequencing of a metagenomic sample.
 36. The method of claim 34, whereindetermining one or more probable characteristics associated with thesample biological sequence comprises determining one or more antibioticslikely to be resisted.
 37. The method of claim 36, further comprisinggenerating a report that comprises a ranked listing of the antibioticslikely to be resisted.
 38. The method of claim 34, wherein each of theone or more assertions is expressed as a Resource Description Framework(RDF) triple.
 39. The method of claim 34, wherein the of calledbiological sequences of the sequence dataset comprise at least one ofcalled biological sequences that provide resistance to one or moreantibiotics and called biological sequences that mediate regulation ofantibiotic resistance.
 40. The method of claim 39, wherein the one ormore assertions of the custom knowledgebase comprise assertions thatencode relationships between the called biological sequences of thesequence dataset and at least one of antibiotic resistance elements andregulatory elements.